We will work with the data frame flights, which is included in the nycflights13 package. To get started load tidyverse and nycflights13 with
library(tidyverse)
library(nycflights13)
You may need to install nycflights13. Run install.packages("nycflights13") in your RStudio Console pane.
Package nycflights13 contains a data frame flights that has on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013. Take a few minutes to examine the variables and their descriptions with regards to flights. Run ?flights in your RStudio Console pane.
flights
Object flights is a tibble. Another way to view the tibble in order to see all variables is with function glimpse().
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
Before you get started, take a few minutes to refresh on some of R’s comparison operators detailed below.
| Operator | Description |
|---|---|
> |
greater than |
< |
less than |
>= |
greater than or equal to |
<= |
less than or equal to |
== |
equal to |
!= |
not equal to |
& |
and (ex: (5 > 7) & (6*7 == 42) will return the value FALSE) |
| |
or (ex: (5 > 7) | (6*7 == 42) will return the value TRUE) |
%in% |
group membership |
To evaluate group membership:
# Generating the group:
set.seed(634789234)
die.out <- sample(x = 1:6, size = 10, replace = T)
die.out
#Checking for group membership:
die.out %in% c(3, 4)
c(3, 4) %in% die.out
die.out %in% c(1)
c(1) %in% die.out
Package dplyr is based on the concept of functions as verbs that manipulate data frames.
| Function | Action and purpose |
|---|---|
filter() |
choose rows matching a set of criteria |
slice() |
choose rows using indices |
select() |
choose columns by name |
pull() |
grab a column as a vector |
rename() |
rename specific columns |
arrange() |
reorder rows |
mutate() |
add new variables to the data frame |
transmute() |
create a new data frame with new variables |
distinct() |
filter for unique rows |
sample_n / sample_frac() |
randomly sample rows |
summarise() |
reduce variables to values |
Make use of %>% operator and any of the functions in package dplyr to answer the following questions.
flights for those in January with a destination of Detroit Metro (DTW) or Chicago O’Hare (ORD).flights %>% filter(dest == 'DTW' | dest == 'ORD' & month == 1)
flights for those before April with a destination that is not Detroit Metro (DTW) and had an origin of JFK.flights %>% filter(dest != 'DTW' & month <= 4 & origin == 'JFK')
flights.flights %>% slice(c(1,3,7,20))
desc()flights %>% arrange(desc(flights$distance))
flights %>% arrange(desc(flights$dep_delay))
flights.flights %>% select('month','origin', 'dest')
flights called gain, where gain is the arrival delay minus the departure delay.gain = flights$arr_delay - flights$dep_delay
cbind(flights, gain)
flights_EWR = subset(flights, origin == 'EWR')
flights_EWR
flights_EWR %>% summarise(dep_delay)
flights_EWR %>% summarise(arr_delay)
Grouping adds substantially to the power of the dplyr functions. We will focus on using summarise() with group_by(), but grouping also can be used with other dplyr functions.
flights_sorted = subset(flights, dest == 'ORD' & carrier == 'UA')
flights_sorted
length(flights_sorted$arr_delay)
[1] 6984
mean(flights_sorted$arr_delay)
[1] NA